9 research outputs found

    Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

    No full text
    This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to non-local processors. Experimental results in a cluster of Pentiums by using various MPI send primitives show that the total completion time is significantly reduced. Index Terms—Loop tiling, communication overlapping, supernodes, hyperplanes, MPI send-receive primitives.

    Enhancing the Performance of Tiled Loop Execution onto Clusters using Memory Mapped Network Interfaces and Pipelined Schedules

    No full text
    This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. Our experimental testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are necessarily exchanging data and should also have large computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined datapath where computation phases are overlapped with communication ones, instead of being interleaved with them. We are using DMA communication mode, to remote write (send) data to other nodes, while the host CPU is computing all iterations within each tile. We achieve zero-copy communication through pinned-down physical memory regions for DMA (PCI exported segments to SCI global space). Results illustrate that when using enhanced communication features such as DMA transfers, memorymapped interfaces and zero-copy mechanisms, overall performance is considerably enhanced than when typically using conventional, CPU and kernel bounded, communication primitives.

    A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

    No full text
    This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes

    Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces

    No full text
    This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives. Keywords: memory mapped network interfaces, DMA, pipelined schedules, tile grouping, communication overlapping, SMPs
    corecore